TCGA Annotations

The goal of this notebook is to introduce you to the TCGA Annotations BigQuery table. You can find more detail about Annotations on the TCGA Wiki, but the key things to know are:

  • an annotation can refer to any "type" of TCGA "item" (eg patient, sample, portion, slide, analyte or aliquot), and
  • each annotation has a "classification" and a "category", both of which are drawn from controlled vocabularies.

The current set of annotation classifications includes: Redaction, Notification, CenterNotification, and Observation. The authority for Redactions and Notifications is the BCR (Biospecimen Core Resource), while CenterNotifications can come from any of the data-generating centers (GSC or GCC), and Observations from any authorized TCGA personnel. Within each classification type, there are several categories.

We will look at these further by querying directly on the Annotations table.

Note that annotations about patients, samples, and aliquots are separate from the clinical, biospecimen, and molecular data, and most patients, samples, and aliquots do not in fact have any annotations associated with them. It can be important, however, when creating a cohort or analyzing the molecular data associated with a cohort, to check for the existence of annotations.

As usual, in order to work with BigQuery, you need to import the python bigquery module (gcp.bigquery) and you need to know the name(s) of the table(s) you are going to be working with:


In [1]:
import gcp.bigquery as bq
annotations_BQtable = bq.Table('isb-cgc:tcga_201607_beta.Annotations')

Schema

Let's start by looking at the schema to see what information is available from this table:


In [2]:
%bigquery schema --table $annotations_BQtable


Out[2]:

Item Types

Most of the schema fields come directly from the TCGA Annotations. First and foremost, an annotation is associated with an itemType, as described above. This can be a patient, an aliquot, etc. Let's see what the breakdown is of annotations according to item-type:


In [3]:
%%sql

SELECT itemTypeName, COUNT(*) AS n
FROM $annotations_BQtable
GROUP BY itemTypeName
ORDER BY n DESC


Out[3]:
itemTypeNamen
Shipped Portion1749
Aliquot1729
Patient1380
Analyte789
Slide552
Sample114
Portion9

(rows: 7, time: 1.0s, 69KB processed, job: job_cRN_-AkWjppd0jy_Ud4lAb4AFQo)

The length of the barcode in the itemBarcode field will depend on the value in the itemTypeName field: if the itemType is "Patient", then the barcode will be something like TCGA-E2-A15J, whereas if the itemType is "Aliquot", the barcode will be a full-length barcode, eg TCGA-E2-A15J-10A-01D-a12N-01.

Annotation Classifications and Categories

The next most important pieces of information about an annotation are the "classification" and "category". Each of these comes from a controlled vocabulary and each "classification" has a specific set of allowed "categories".

One important thing to understand is that if an aliquot carries some sort of disqualifying annotation, in general all other data from other samples or aliquots associated with that same patient should still be usable. On the other hand, if a patient carries some sort of disqualifying annotation, then that information should be considered prior to using any of the samples or aliquots derived from that patient.

To illustrate this, let's look at the most frequent annotation classifications and categories when the itemType is Patient:


In [4]:
%%sql

SELECT
  annotationClassification,
  annotationCategoryName,
  COUNT(*) AS n
FROM
  $annotations_BQtable
WHERE
  ( itemTypeName="Patient" )
GROUP BY
  annotationClassification,
  annotationCategoryName
HAVING ( n >= 50 )
ORDER BY
  n DESC


Out[4]:
annotationClassificationannotationCategoryNamen
NotificationPrior malignancy407
NotificationAlternate sample pipeline200
NotificationHistory of unacceptable prior treatment related to a prior/other malignancy139
NotificationSynchronous malignancy110
NotificationNeoadjuvant therapy102
NotificationItem is noncanonical81

(rows: 6, time: 2.0s, 308KB processed, job: job_nZ2y7s_1hNEd_FPiRPYSwoVPyWc)

The results of the previous query indicate that the majority of patient-level annotations are "Notifications", most frequently regarding prior malignancies. In most TCGA publications, "history of unacceptable prior treatment" and "item is noncanonical" notifications are treated as disqualifying annotations, and all data associated with those patients is not used in any analysis.

Let's make a slight modification to the last query to see what types of annotation categories and classifications we see when the item type is not patient:


In [5]:
%%sql

SELECT
  annotationClassification,
  annotationCategoryName,
  itemTypeName,
  COUNT(*) AS n
FROM
  $annotations_BQtable
WHERE
  ( itemTypeName!="Patient" )
GROUP BY
  annotationClassification,
  annotationCategoryName,
  itemTypeName
HAVING ( n >= 50 )
ORDER BY
  n DESC


Out[5]:
annotationClassificationannotationCategoryNameitemTypeNamen
NotificationItem is noncanonicalShipped Portion1741
CenterNotificationItem flagged DNUAliquot1057
NotificationItem is noncanonicalSlide541
NotificationItem is noncanonicalAnalyte464
ObservationGeneralAnalyte179
CenterNotificationCenter QC failedAliquot153
ObservationGeneralAliquot116
NotificationItem in special subsetAnalyte104
NotificationBarcode incorrectAliquot84
RedactionGenotype mismatchAliquot80
NotificationItem is noncanonicalSample67
RedactionInadvertently shippedAliquot54

(rows: 12, time: 2.9s, 308KB processed, job: job_3Yv7RlfX-Pc9ft0pyOOQarPFD7U)

The results of the previous query indicate that the vast majority of annotations are at the aliquot level, and more specifically were submitted by one of the data-generating centers, indicating that the data derived from that aliquot is "DNU" (Do Not Use). In general, this should not affect any other aliquots derived from the same sample or any other samples derived from the same patient.

We see in the output of the previous query that a Notification that an "Item is noncanonical" can be applied to different types of items (eg slides and analytes). Let's investigate this a little bit further, for example let's count up these types of annotations by study (ie tumor-type):


In [6]:
%%sql

SELECT
  Study,
  COUNT(*) AS n
FROM
  $annotations_BQtable
WHERE
  ( annotationCategoryName="Item is noncanonical" )
GROUP BY
  Study
ORDER BY
  n DESC


Out[6]:
Studyn
OV743
GBM519
KIRC455
COAD314
LUAD238
LUSC231
HNSC212
READ115
KICH47
PRAD27
CHOL15
ACC12
PAAD8
BRCA2

(rows: 14, time: 0.8s, 177KB processed, job: job_xMNKI2Ng_GENgTZzFbH0R_OAaUE)

and now let's pick one of these tumor types, and delve a little bit further:


In [7]:
%%sql

SELECT
  itemTypeName,
  COUNT(*) AS n
FROM
  $annotations_BQtable
WHERE
  ( annotationCategoryName="Item is noncanonical"
    AND Study="OV" )
GROUP BY
  itemTypeName
ORDER BY
  n DESC


Out[7]:
itemTypeNamen
Slide409
Shipped Portion220
Analyte110
Patient3
Sample1

(rows: 5, time: 0.9s, 247KB processed, job: job_cALh5aAuWIlhHpmRkCpKDoN8u4c)

Barcodes

As described above, an annotation is specific to a single TCGA "item" and the fields itemTypeName and itemBarcode are the most important keys to understanding which TCGA item carries the annotation. Because we use the fields ParticipantBarcode, SampleBarcode, and AliquotBarcode throughout our other TCGA BigQuery tables, we have added them to this table as well, but they should be interpreted with some care: when an annotation is specific to an aliquot (ie itemTypeName="Aliquot"), the ParticipantBarcode, SampleBarcode, and AliquotBarcode fields will all be set, but this should not be interpreted to mean that the annotation applies to all data derived from that patient.

This will be illustrated with the following two queries which extract information pertaining to a few specific patients:


In [8]:
%%sql

SELECT
 Study,
 itemTypeName,
 itemBarcode,
 annotationCategoryName,
 annotationClassification,
 ParticipantBarcode,
 SampleBarcode,
 AliquotBarcode,
 LENGTH(itemBarcode) AS n
FROM
  $annotations_BQtable
WHERE
  ( ParticipantBarcode="TCGA-61-1916" )
ORDER BY n ASC


Out[8]:
StudyitemTypeNameitemBarcodeannotationCategoryNameannotationClassificationParticipantBarcodeSampleBarcodeAliquotBarcoden
OVPatientTCGA-61-1916Item in special subsetNotificationTCGA-61-1916  12
OVAnalyteTCGA-61-1916-01A-01RItem is noncanonicalNotificationTCGA-61-1916TCGA-61-1916-01A 20
OVAnalyteTCGA-61-1916-02A-01TItem is noncanonicalNotificationTCGA-61-1916TCGA-61-1916-02A 20
OVAnalyteTCGA-61-1916-01A-01DItem is noncanonicalNotificationTCGA-61-1916TCGA-61-1916-01A 20
OVAnalyteTCGA-61-1916-02A-01RItem is noncanonicalNotificationTCGA-61-1916TCGA-61-1916-02A 20
OVAnalyteTCGA-61-1916-02A-01DItem is noncanonicalNotificationTCGA-61-1916TCGA-61-1916-02A 20
OVAnalyteTCGA-61-1916-11A-01DItem is noncanonicalNotificationTCGA-61-1916TCGA-61-1916-11A 20
OVAnalyteTCGA-61-1916-01A-01GItem is noncanonicalNotificationTCGA-61-1916TCGA-61-1916-01A 20
OVAnalyteTCGA-61-1916-02A-01WItem is noncanonicalNotificationTCGA-61-1916TCGA-61-1916-02A 20
OVAnalyteTCGA-61-1916-02A-01GItem is noncanonicalNotificationTCGA-61-1916TCGA-61-1916-02A 20
OVAnalyteTCGA-61-1916-11A-01WItem is noncanonicalNotificationTCGA-61-1916TCGA-61-1916-11A 20
OVAnalyteTCGA-61-1916-01A-01TItem is noncanonicalNotificationTCGA-61-1916TCGA-61-1916-01A 20
OVAnalyteTCGA-61-1916-01A-01WItem is noncanonicalNotificationTCGA-61-1916TCGA-61-1916-01A 20
OVSlideTCGA-61-1916-01A-21-1559Item is noncanonicalNotificationTCGA-61-1916TCGA-61-1916-01A 24
OVAliquotTCGA-61-1916-02A-01R-0808-01GeneralObservationTCGA-61-1916TCGA-61-1916-02ATCGA-61-1916-02A-01R-0808-0128
OVAliquotTCGA-61-1916-01A-01D-0803-01Item flagged DNUCenterNotificationTCGA-61-1916TCGA-61-1916-01ATCGA-61-1916-01A-01D-0803-0128

(rows: 16, time: 0.8s, 727KB processed, job: job_Kms6IP7VWumzFlDh5X2qoeWxBBg)

In [9]:
%%sql

SELECT
 Study,
 itemTypeName,
 itemBarcode,
 annotationCategoryName,
 annotationClassification,
 ParticipantBarcode,
 SampleBarcode,
 AliquotBarcode,
 LENGTH(itemBarcode) AS n
FROM
  $annotations_BQtable
WHERE
  ( ParticipantBarcode="TCGA-GN-A261" )
ORDER BY n ASC


Out[9]:
StudyitemTypeNameitemBarcodeannotationCategoryNameannotationClassificationParticipantBarcodeSampleBarcodeAliquotBarcoden
SKCMPatientTCGA-GN-A261Tumor tissue origin incorrectRedactionTCGA-GN-A261  12
SKCMPatientTCGA-GN-A261Neoadjuvant therapyNotificationTCGA-GN-A261  12

(rows: 2, time: 1.0s, 727KB processed, job: job_jGPJrgpn_baosYFarH9fRQp-ct4)

As you can see in the results returned from the previous two queries, the SampleBarcode and the AliquotBarcode fields may or may not be filled in, depending on the itemTypeName.


In [10]:
%%sql

SELECT
 Study,
 itemTypeName,
 itemBarcode,
 annotationCategoryName,
 annotationClassification,
 annotationNoteText,
 ParticipantBarcode,
 SampleBarcode,
 AliquotBarcode,
 LENGTH(itemBarcode) AS n
FROM
  $annotations_BQtable
WHERE
  ( ParticipantBarcode="TCGA-RS-A6TP" )
ORDER BY n ASC


Out[10]:
StudyitemTypeNameitemBarcodeannotationCategoryNameannotationClassificationannotationNoteTextParticipantBarcodeSampleBarcodeAliquotBarcoden
HNSCAnalyteTCGA-RS-A6TP-10A-01DGeneralObservationDNA analyte UUID: 8304F61F-C217-4B9F-BA64-6486DA54E6C8 was involved in an extraction protocol deviation wherein an additional column purification step was used as a means of buffer exchange on the column-eluted analyte.TCGA-RS-A6TPTCGA-RS-A6TP-10A 20

(rows: 1, time: 1.1s, 1MB processed, job: job_X_hNrE1FA0XESLN6cFEhRUYznWQ)

In this example, there is just one annotation relevant to this particular patient, and one has to look at the annotationNoteText to find out what the potential issue may be with this particular analyte. Any aliquots derived from this blood-normal analyte might need to be used with care.


In [ ]: